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ABSTBACT ^ " 

The purpose of this paper is to. prove that bne 
currently cecoafltended aethod of obtaining the reliability of an: 
instruaent defined on a population of aggregate units is invalid. 
This^aethod raHdoaly spli-ts the aggregate into tvo halves, correlates 
the tvo half unit scores by a Fearson product aoaent Qorr elation 
coefficient, and corrects the correlation coefficient using the 
Spearaan^Brovn prpphecy foraula. our approach Has to cpapare t^is. 
procedure to th« ^standard aethod of foraing randoa^ split halves of 
iteas on the test. In addition the reliability of an instruaent ^vas 
obtained by both aethods. It vas found that the currently recoanended 
-^ethod "is an. underestimate of the reliability of a test dj^fined on an 
aggr<egate. (Author) ^ ^ * 
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In educational research and evaluation the unit of analysis Is 
frequently some aggregate of ismaller units. A popul^ar example Is the use 
of classrooms, where an observation on a classroom Is defined by some 
function of the observations- on the students in the, classroomv The 
purpose of the present paper Is to co'nslder the problem of estimating the 
reliability of a test attendant to -the use of aggregate units* More ^ 
specifically we prove that a currently recommended method of estimating 
the reliability of a test defined on a population of aggregate units is 
invalid. Our discussion Is limited to the situation where Individuals 
are measured by a unl-dlmenslonal test and observations on aggregate 
units are defined by the mean of the observations of Individuals comprising 
the aggregate units. 

The paper proceeds by first considering the relationship between the 
reliability of a test for a population of aggr|egate units and fot' the 
population of . Individuals ,u^ed to form those aggregate units. Next, we 
define the method of estimating reliability that Is shown to be invalid. 
The analytic demonstration of invalidity is supplemented by a numerical 
example. 

The reliability of a test for aggregate units 

It is well known that the reliability of an ins^trument ^can vary across 
population^ for which the instrtnnent may be used; Even when the set of 
Individuals is held constant the choice of unit, of analysis represents a 
further definition of the population. It follows that for a given' set 
of children, the reliability of a test for the population of children 
might well differ from the reliability of the same test for the population 
of classrooms in which the children experience their schooling. Similarly 
the reliability fdr this population of classrooms might differ from the 
reliability for the population of schools in which the classrpoms are 
nested. * ♦ 

Shaypoft C^963) has investigated the relationship between the 
reliability of a test for a population of individuals and its reliability, 
for a population of aggregate units formed by those individuals » She 

pointed out that the two reliabilities will be equal if the aggregate units 

1 ' ■■ ' • ' ■ ■ 

are' formed randomly^^ The reliability of a test for aggregate units will 



be greater than for Indlvlduars when the variance of the aggregate unit 
means Is greater than what would be expected by random grouping* 
Although this Is typically the case In education she goes on to say that 
the reverse will be true when the variance of the aggregate means Is less 
than would be expected from random grouping. The size of the difference 
between the two reliabilities Is a function of 

1) the degree of departure from randomness, 
,2) the .number of Individuals In each aggregate^, 
3)^ the size of the reliability defined on individuals. 
The invalid method of estimating reriablllty 

The method to be considered for estimating the reliability oF an 
instrument for a population of aggregate units can be described using 
schools as an example. First, randomly split e^ch school into two halves 
and obtain a score on the Instrument for each random half • Then, calculate 
the correlation between the two half unit scores by a Pearson product 
moment correlation coefficient. The reliability defined on schools la 
ob'tained by correcting the correlation coefficient us ing\he Spearman- 
Brown prophecy formula. , 

Our f Irs^ exposure to the above described method of estimating rel^abil 
itles was during the second author's participation in a consultant panel 
conference on the evaluation of th& Follow Through Program* At that 
conference the method was suggested for estimating reliabilities of 
pretests where school was the unit of analysis* The reliabilities were 
needed for subsequent corrections to be made in analyses of covariance. 
Later we discovered that the procedure had beeii u^ed, except for the' part 
involving the Spearman-Brown correction, by Dyer^ Linn, and Patton (1969) 
as a method for estimating the reliability of test defined on a 
population of school systems. O^Connor (1972) used Dyer, ^t al 
reliabilities in an example, but first corrected them using the Spearman- 
Brown formula to obtain estimates of the parallel forms reliabilities 
based on the full school systems. Since the procedure for ^estimating 
the reliability of a test defined on a population of aggregate units has 
Q enjoyed some popularity, it is of ' interest to Investigate £he properties 
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Qf the procedure. * \ * 
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Analytic Demonstration 

Our general approach was to compare the procedure for estlinatlng 
reliability und^r Investigation to the standard method of forming 
random split halves of Items on the test, where a school* s score on a 
split half of the test Is the mean score for tLe students In the school* 
Where ''the two procedures are not In agreement the former Is consldeted In 
etror. \- 

Starting with the spilt units procedure, the correlation between 
half unit scores on the full test Is by definition 

r 2 



where and are deviation half unit scores on the full test for the 

V ^ 

two sets of halves^ a and o.., are the two stan4ard deviations, and N 

^1 h I . , . 

Is the number of units* Assuming the two standard deviations to be equal 
(which Is'' the long run expectation) and that true scores and errots of 
measurement ar^ Independent for half units, ' 

N 

No , 

. ... -^1 
where t| and are true deviation half unit scores on the full test. 

Further the correlation between the true half unit scores on the full 



test is "by definition 



N 

Et't' 



1.2 Na„,a^, 
^1 ^2 



*o Is not a parameter. 
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which simplifies to 

■ ■ . • ■ ■ ' • 

- r_,_., , (3) 

^1*2 Na ,o , 
. ■ • 12 

given the assumption that o_, » a_,,. By way of equation (3)-, equation (2) 

1 2 , 

becomes 

2 



.12 ^12 2 



rn^tm, . (A) 



1 
X 



_ 1 - J 

2 2 

Now a.,, and a^i need to be defined In terms of half test. for full 

^1 ' -1 

2 

-4inlt- statistics* First, consider ^x|* Letting and X2 denote half- 

test^scores for full units and assuming that the variances of the two 

2 2 

half test scores on full units are equal, " » it follows that the 

\ 2 

variance of the full test for the full units is 

* 

4- ^4/ 2rj^X2\ * 



where „ denotes the correlation between half-test scores for full 

^2 ' ; ■ ; ■ ~ 



units. But, the variance of the full test for full units is also 

- l/4(2a^, + 2r„,„,o^,) , . (6) 

A ^ 12 1 

since ' ^ 

. ^ x; + XI' . 
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Using equations (5) and (6) 



lal + 2r„ Y - l/4(2oJ, + 2r„,„,oJ;) , 
*1 *X 2 *1 *1 nL*2 ^ 



from which It follows that 

*4 (1 + „ ) 

2 ^ V2 
o , - . (7) 

\ (1 +. r ,) 

V2 

2 

A -similar strategy can be used to define a_, in terms of half tesC 

1 

full unit statistics. Letting T^^ and T2 denote true half-test scores for 



full units and assuming that the variance of the two sets of true half test 

I 2 

1 2 



2 2 

scores are equal, o " o , it follows that the variance of the true scores 



for the full test is 

al - 2oJ + 2r oj , (8) 
^ 1 ^1^2 ^1 

where r., _ is the correlation between T. and T. . By classical measurement 
12 1 z 

theory r_ _ , equals one so that equation (8) becomes 
12 

ol = %al . (9) 
1 



But the variance of true scores for the full test is also 



- l/4(2oJ, + 2r„,_,aJ,) , (10) 
1 4 2 n 

where again the prime Indicates that the statistics are for half units on 
the full test. Using equations (9) and (10) 

Ual - l/4(2oJ, + 2r , ,oJ,) . 
- 1 4 ^12-1 



p 



from which It follows that > 

\ - (1 + r_,_,) • -^^^^ 

Returning to equation (4) and using the definitions provided by equations 
(7) and (11) ^ 

! 2 

8?T • 



V2 4o^ (1 + r„ „ ) 

^1 V2 . 



^1*2 



which reduces to 



2 

2r_,-,,o (1 + r ,„,) 
V2 ^1 W 



X'X' 

^ 2 (1 ^. r )o\ (1 + r ) 

n^2 ^1 ^1*2 / 

Solving for r , , , 

' '-2 (12) 

^1^2 (1 + r_,_,)o^ (1 +^r„ „ ) - 2r 2 

Since r Is the reliability of the half test for full units It follows 
V2 

2 . 
that ., , .o_ 

V - ^ . (13) 
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Substituting the definition provided by equation (13) into equation (12) 



which reduces to 



1'2 "1"2 . ^1*2 V2 



2r„ „ r„,_, 



^x;x; " 2 - (1 - r-, „ )(i - T^\^,) . (14) 

) From equation (14) It follows that the two procedures for estimating 
reliability yield identical results when the Correlation between the. 
true half unit scottes on the full test, r-,,-,ji equals one. Given random 

splits on the units, r will equa:l one only when the standard error of 

•12 . 

the difference between the true score means of each pair of half units 

is zero. The standard errors have expectations greater than zero ^ for 

schools of finite size. Thus ^or practical situations the split unit 

procedure does not yield results identical to the split test procedure. 

Since we know that the estimation procedure under investigation is 

not inr agreement with the standard, it is of interest to describlB the 

nature of their lack d£ agreement. Our approach was to consider relative 

errot (RERR) where RERR is defined as true estimate - new estimate . 

true estimate 

Defining r as the true estimate and r^i^t the new estimate, we 
^^2 ^1*2 

obtain 



RERR =;ry y 



2 - (1 - r y )(1 - r^.^t) 

^^2 ^1^2 J 



^^1^2 



This reduces to • . 

r [2 - (1 - r )(1 - r ,_,)] 

W>(i -^'x,x, ) 

^^^^ ° 2 - a - r^^ )(! - r^.,.) ' ^^^V 

Note that when r , , = 1,00, RERR « 0 which agrees with our earlier , 

finding. In order to find when RERR is a maximum, we took the 

derivative with respect to r • The values of r that make the 

12 ^ ^1^2 ' 

derivative zero give the points of r where RERR i6 maximized. " We 

V2 

found that there are no maximums or nkinimums except at the endpoiats» 
since r is bounded by 0 and 1, W€ found for r_,, ■ , > 0 that 



1 ^ipt»j« t 



1 " ^»j»f»jif 



^, ^ ^ ^ 1 RERR 1 1 r f , and fqt r_,-„ ' < '0 that / . ^ ^ ^ > 
^ "^T'T^ ; ^1^2 ^1^2 - ^ ^0T[^ 



RERR > 



1 - r Alsp since RERR is always positive for all values of r-,,-,, 

12 . ^ ^1^2 

and r Y , it follows that r__,-., < r„ „ . 

c 12 12 12 ' ' 

Since for all practical situations the correlation between half 
unit scores for the full test has been shown to be less than t^e 
correlation between half-test scores for the full units, their Spearman- 
Brown corrected counterparts must maintain the same inequality. The 
conclusion is that the split units method provides an underestimate of 
the reliability of a teirt |iefined on a population of aggregate units • 
Example 

In order to illustrate the inequality 'of the two procedures for 
estimat;lhg7the reliability of a test for a population of aggregate uniti 
we used data on children in 35 classrooms ranging in size* from 6 to 17 



9 



children. The basic data consisted of children's responses to the 

thirteen Items on Part A of the Reading Subtest of the MAT Primary Level II, 

Form F. The children were second graders tested In the spring of 1973. 

A table of ^random numbers* was used to split each class Into two 
halves I then half class means on the full test were calculated. The 
mean and variance of the half class means for one set half classes ^ere 
6.29 and 3,02 respectively, while the mean and viarlance of the other set 
of half classes were 6; 40 and 2.72 respectively. The mean equality of the 
two variances supports the practical utility "of the corresponding assumption 
of equal variances, made in the ^,prevlous analytic demonstration. The 
correlation between the two set^s of half class means was .17. The iSpearman- 
Brown correctibn yields the value .2j9. 

A table of random numbers was jalso used' to split the test into two 

halves, then full class means on thp half, tests were calculated. The mean 

and variance of the full class means for one half of the test were 2.62 

and .38 respectively, while the mean and variance for the other half of 

the test were 3.70 and .56 respectively. Again the two variances were 

" ' ' \ ^ ' \ 

nearly eqyal which supported the corresponding assumption made previously. 

For longer ^tests or tests^with even \number of items the assumption of 

equal half test variances is even more likely. The correlation between the 

two half tests was .82 which became .90 using the Spearman-Brown correction. 

. Thus for the* example the dlscrep^ancy between the twb procedures for 

estimating reliability was substantial and in the predicted direction. 

A secondary interest was to use the data to provide an, example of 

the difference between the -reliability of a test for aggregati^ units and 

the same test for the individuals comprising those aggregate units. Using 

the same split of the test as previously, the correlation between the two 

halves for children was .41 which became .58 using the Spearman-Browif / 

correction. , . • 

Conclusions ^""^ 

^ . *• ■ . • ■ 

When the unit of analysis is some aggregate unit, the reliability of a 

test should be reported for the population of aggregate units rather^than 

for the' population of individuals which form those units. In theory the 



• ' . ' 10 ' . . - 

size of the rellabllltes for the two populations of units cfin differ In * 
either direction, but in educational research the reliability defined on th^ 
population of aggregate units will typically be the larger. 

The procedure of estimating the reliability^of .a test for aggregate ' 
units by forming split units, systemajtically underestimates the reliability 
and so should not be used. One acceptable method for estimating the 
reliability of a test for aggregate units parallels the familiar split test/ 
method. Shaycoft (1963) has provided other estimation procedures that are 
a function- of the reliability of the test for the population of individuals 
on which the aggregate units are defined. \ 

The ut^illty of our f indlng can^ be illuVtrated by an example. When 
an educational researcher is attempting to "tease out" causal relation- 
ships where Irandom assignment has not been employed, he sometimes uses 
partial correlations or estimated true scores analysis of -covarlance 
(Porter, 19^3). For the former, the correlations of the variable belnfg 
\.ontrolled with the other variables should be corrected for attenuation 
(Kahmeman, 1963) . . For the latter the reliability of the covarlate can be ^ 
used in estimated true scores analyses of co variance (Porter, 1974)/ When 
the unit of analysis represents some aggregate^^of smaller unlt^, . Che ^ 
reliabilities used for the corrections should be defined on the population 
of aggregate units. The method investigated here would provide reliability 
coefficients which are too small and %hus caure the statistical analyses 
to over-correct for the control variable. 



References 



Dyer. H. 5>.,'Linn, R. X., and Patton, M. J. A comparison of fout^-^ethods 
of obtaining discrepancy measures based on observed and predicted 
school system means on achievement tests. AERJ . 1969(6), 591-605. 

■ / ■ . ■ * ; . . - ■ 

^Kahifrtman, D. Control of spurious association fpd the reliability of the 
/ ^ontrolled variable. Psychological Bulletin . 1965, 64, 326-329. 

0*Connor, Test theory and the measurement of change, RER, Winter 1972, 42,1. 

Porter, A. C- Analysis strategies for some common evaluation paradigms.- 
' Paper presented at the meetings of the American Educational Research 
' Association, 1973.. ' . 

Shaycoft, M. F. The statistleal characteristics 6 r school means. In 
- Flanagan, J. C, Dailey, J. T., Shaycroft,- M. F.., Otr, B., and 
Goldberg. T.. studies of the American h i^h schoo l. (Final report to 
the U.S. Office of Education, Cooperative Research Project No. 226), 
Washington, B.C.: Project TALENT Office, tiniversity of" Pittsburgh, 
1962. \ ' ■ 

Shaycoft, Wt. F. The use of school means as variables. Revised version 

of a paper^ presented at the annual meetings of the American Psychological 
• Association, Philadelphia, September, 1963. 



